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© Automatic indexing of audio using speech recognition. 
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© A method of automatically aligning a written tran- 
script with speech in video and audio clips. The 
disclosed technique involves as a basic component 
an automatic speech recognizer. The automatic 
speech recognizer decodes speech (recorded on a 
tape) and produces a file with a decoded text. This 
decoded text is then matched with the original writ- 
ten transcript via identification of similar words or 
clusters of words. The results of this matching is an 
alignment of the speech with the original transcript. 
The method can be used (a) to create indexing of 
video clips, (b) for "teleprompting" (i.e. showing the 
next portion of text when someone is reading from a 
television screen), or (c) to enhance editing of a text 
that was dictated to a stenographer or recorded on a 
tape for its subsequent textual reproduction by a 
typist. 
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Background of the Invention 

This invention relates generally to a system for 
indexing of audio or audio-video recordings and 
textual data, for example aligning texts that are 
stored in computer files with corresponding data 
that stored on audio-video media, such as audio 
tape, video tape, or video disk. The typical problem 
in this area can be formulated as follows. 

Consider an audio-video recording and its writ- 
ten transcript. To index the video, it is necessary to 
know when words appearing on the transcript were 
spoken. To find an appropriate part of the record- 
ing, we need a text-speech index containing data 
pairs for each word in the transcript. Each data pair 
consists of a word in the transcript and the f- 
number describing the position of the word on the 
tape. Each data pair can be represented as (word, 
f-n umber). 

We will use the term "word" to refer both to 
single words such as "dog", "step", or "house", 
and to phrases such as "United States of Amer- 
ica", "production of wheat", etc. 

Indexing an audio-video recording by text en- 
hances one's ability to search for a segment of the 
audio recording. It is often faster to manually or 
automatically search for a segment of text than it is 
to search for a segment of audio recording. When 
the desired text segment is found, the correspond- 
ing audio recording can be played back. 

Indexing an audio recording by text also en- 
hances one's ability to edit the audio recording. By 
moving or deleting words in the text, the cor- 
responding audio segments can be moved or de- 
leted. If there is maintained a vocabulary of stored 
words and stored audio segments corresponding to 
the words, then when words are inserted in the 
text, the corresponding audio segments can be 
inserted in the audio recording. 

Two example applications where it is neces- 
sary to align speech with a corresponding written 
transcript are (1) providing subtitles for movies, and 
(2) fast retrieval of audio-video data recorded at 
trial from a stenographic transcript by an appellate 
court or a deliberating jury. 

A conventional approach to align recorded 
speech with its written transcript is to play back the 
audio data, and manually select the corresponding 
textual segment. This process is time consuming 
and expensive. 

Other work deals with relationships (or syn- 
chronization) of speech with other data (e.g. facial 
movements) that are time aligned. For example 
U.S. Patent No. 5,136.655 (Branson) discloses the 
indexing of different data (words and animated pic- 
tures). There, the files with aligned words and 
pictures were obtained by a simultaneous decoding 
of voice by an automatic speech recognizer and of 



time aligned video data by an automatic pattern 
recognizer. In another example, U.S. Patent No. 
5,149,104 (Edelstein), audio input from a player is 
synchronized with a video display by measuring 
5 the loudness of a speaker's voice. 

While these methods provide some kind of 
automatic annotation of audio-video data they are 
still not well suited for indexing of stored speech 
and textual data that are not time correlated. 

10 

Summary of the Invention 

It is an object of the invention to automatically 
map an index text to corresponding parts of an 

T5 audio or audio/video recording. 

This object is solved basically by the features 
as laid down in the independent claims. Further 
advantageous embodiments of the present inven- 
tion are laid down in the subclaims. 

20 According to the invention, an apparatus for 

indexing an audio recording comprises an acoustic 
recorder for storing an ordered series of acoustic 
information signal units representing sounds gen- 
erated from spoken words. The acoustic recorder 

25 has a plurality of recording locations. Each record- 
ing location stores at least one acoustic information 
signal unit. 

The indexing apparatus further includes a 
speech recognizer for generating an ordered series 

30 of recognized words having a high conditional 
probability of occurrence given the occurrence of 
the sounds represented by the acoustic information 
signals. Each recognized word corresponds to at 
least one acoustic information signal unit. Each 

as recognized word has a context of at least one 
preceding or following recognized word. 

A text storage device stores an ordered series 
of index words. The ordered series of index words 
comprises a visual representation of at least some 

40 of the spoken words represented by the acoustic 
information signal units. Each index word has a 
context of at least one preceding or following index 
word. 

Means are provided for comparing the ordered 
45 series of recognized words with the ordered series 
of index words to pair recognized words and index 
words which are the same word and which have 
matching contexts. Each paired index word is 
tagged with the recording location of the acoustic 
so information signal unit corresponding to the recog- 
nized word paired with the index word. 

In one aspect of the invention, each recognized 
word comprises a series of one or more characters. 
Each index word comprises a series of one or 
55 more characters. A recognized word is the same as 
an index word when both words comprise the same 
series of characters. 
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The context of a target recognized word may 
comprise, for example, the number of other recog- 
nized words preceding the target recognized word 
in the ordered series of recognized words. The 
context of a target index word may comprise, for 
example, the number of other index words preced- 
ing the target index word in the ordered series of 
index words. The context of a recognized word 
matches the context of an index word if the context 
of the recognized word is within a selected thresh- 
old value of the context of the index word. 

In another aspect of the invention, each index 
word which is not paired with a recognized word 
has a nearest preceding paired index word in the 
ordered series of index words, and has a nearest 
following paired index word in the ordered series of 
index words. The comparing means tags a non- 
paired index word with a recording location be- 
tween the recording location of the nearest preced- 
ing paired index word and the recording location of 
the nearest following paired index word. 

Preferably, the speech recognizer aligns each 
recognized word with at least one acoustic informa- 
tion signal unit. 

In a further aspect of the invention, each recog- 
nized word which is not paired with an index word 
has a nearest preceding paired recognized word in 
the ordered series of recognized words, and has a 
nearest following paired recognized word in the 
ordered series of recognized words. The context of 
a target recognized word comprises the number of 
other recognized words preceding the target recog- 
nized word and following the nearest preceding 
paired recognized word in the ordered series of 
recognized words. The context of a target index 
word comprises the number of other index words 
preceding the target index word and following the 
nearest preceding paired index word in in the or- 
dered series of index words. The context of a 
recognized word matches the context of an index 
word if the context of the recognized word is within 
a selected threshold value of the context of the 
index word. 

Brief Description of the Drawings 

Figure 1 is a block diagram of an example of 
an apparatus for indexing an audio recording ac- 
cording to the invention. 

Figure 2 is a block diagram on how procedures 
and data in the proposed invention are related. 

Figure 3 is a block diagram of an example of a 
system for automatic aligning text and audio/video 
recordings. 

Figure 4 is a block diagram of an example of 
the mapping module of Figure 2. 

Figure 5 schematically shows the alignment of 
audio/video data and decoded text data. 



Figure 6 schematically shows how the speech 
recognizer vocabulary may be obtained from seg- 
ments of the text transcript. 

5 Description of the Preferred Embodiments 

Figure 1 is a block diagram of an example of 
an apparatus for indexing an audio recording ac- 
cording to the invention. The apparatus comprises 

10 an acoustic recorder 70 for storing an ordered 
series of acoustic information signal units repre- 
senting sounds generated from spoken words. The 
acoustic recorder has a plurality of recording loca- 
tions. Each recording location stores at least one 

75 acoustic information signal unit. 

The acoustic recorder 70 may be, for example, 
a magnetic tape or disk storage unit for a computer 
system. 

The indexing apparatus further comprises a 

20 speech recognizer 72 for generating an ordered 
series of recognized words having a high con- 
ditional probability of occurrence given the occur- 
rence of the sounds represented by the acoustic 
information signals. Each recognized word corre- 

25 sponds to at least one acoustic information signal 
unit. Each recognized word has a context of at 
least one preceding or following recognized word. 

The speech recognizer 72 may be a computer- 
ized speech recognition system such as the IBM 

30 Speech Server Series. 

A text storage device 74 is provided for storing 
an ordered series of index words. The ordered 
series of index words comprises a visual repre- 
sentation of at least some of the spoken words 

35 represented by the acoustic information signal 
units. Each index word has a context of at least 
one preceding or following index word. 

The text storage device 74 may be, for exam- 
ple, a magnetic tape or disk storage unit for a 

40 computer system. 

Finally, the indexing apparatus further com- 
prises a comparator 76 for comparing the ordered 
series of recognized words with the ordered series 
of index words to pair recognized words and index 

46 words which are the same word and which have 
matching contexts. The comparator 76 also tags 
each paired index word with the recording location 
of the acoustic information signal unit correspond- 
ing to the recognized word paired with the index 

so word. 

The comparator 76 may be. for example, a 
suitably programmed digital signal processor. 

Each recognized word and each index word 
comprises a series of one or more characters. The 
55 comparator 76 determines that a recognized word 
is the same as an index word when both words 
comprise the same series of characters. 
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The context of a target recognized word may, 
for example, comprise the number of other recog- 
nized words preceding the target recognized word 
in the ordered series of recognized words. The 
context of a target index word may, for example, 
comprise the number of other index words preced- 
ing the target index word in the ordered series of 
index words. The context of a recognized word 
matches the context of an index word if the context 
of the recognized word is within a selected thresh- 
old value of the context of the index word. 

Each index word which is not paired with a 
recognized word has a nearest preceding paired 
index word in the ordered series of index words, 
and has a nearest following paired index word in 
the ordered series of index words. The comparator 
76 may tag a non-paired index word with a record- 
ing location between the recording location of the 
nearest preceding paired index word and the re- 
cording location of the nearest following paired 
index word. 

In order to correlate each recognized word with 
at least one acoustic information signal unit, the 
speech recognizer preferably aligns each recog- 
nized word with at least one acoustic information 
signal unit. 

Each recognized word which is not paired with 
an index word has a nearest preceding paired 
recognized word in the ordered series of recog- 
nized words, and has a nearest following paired 
recognized word in the ordered series of recog- 
nized words. 

In one embodiment of the invention, the con- 
text of a target recognized word may, for example, 
comprise the number of other recognized words 
preceding the target recognized word and following 
the nearest preceding paired recognized word in 
the ordered series of recognized words. The con- 
text of a target index word may, for example, 
comprise the number of other index words preced- 
ing the target index word and following the nearest 
preceding paired index word in in the ordered 
series of index words. The context of a recognized 
word matches the context of an index word if the 
context of the recognized word is within a selected 
threshold value of the context of the index word. 

Figure 2 describes schematically procedural 
nodules and data entries. Main entry data in this 
process are audio/video data 101 that enter the 
decoding module (automatic speech recognizer) 
103 and the reference transcript data 102. The 
reference transcript data 102 represent the text 
(exact or approximate) of the audio data in 
audio/video data 101. The audio data is processed 
by decoding module 103 and a decoded output 
(recognized words) 104 is produced. The decoding 
module 103 may be a computerized speech rec- 
ognition system such as the IBM Speech Server 



Series (trademark) or the IBM Continuous Speech 
Series (trademark). 

The decoded output 104 and the reference 
transcript 102 are matched in the comparing mod- 

5 ule 105. Matching words in reference transcript 102 
and in decoded output 104 provided transcript 102 
and output 104. All words in the decoded output 
104 are time stamped by time aligner 106 while the 
audio data is decoded in 103. The same time 

io stamps are provided for the corresponding words 
in reference transcript 102. The time stamped ref- 
erence transcript 102 is used to form index data 
107. A user 108 can utilize the indexing to retrieve 
and play back selected recorded audio/video data. 

T5 Figure 3 is a block diagram of an example of a 

system for automatic aligning text and audio/video 
recordings. 

The system of Figure 3 comprises a recording 
medium 12 which stores at least audio data. Re- 

20 cording medium 12 may also store, for example, 
video data. The audio or audio-video data may be 
recorded as either analog or digital signals. 

A text store 24 contains a transcript of the text 
of the speech contained in the audio data on re- 

25 cording medium 12. Transcripts may be produced, 
for example, by a typist who reproduces a text by 
listening to the audio recording and typing the 
words spoken therein. Alternatively, the text may 
be typed (for example by a stenographer) at the 

30 same time the audio is recorded. Instead of typing, 
the text may be produced using an automatic 
handwriting recognizer, or a speech recognizer 
trained to the voice of a speaker who listens to the 
audio recording and redictates the words spoken 

35 therein. 

In the conventional approach for replaying a 
preselected portion of audio-video data recording 
12, the audio-video data would typically be mon- 
itored while being recorded or while being retrieved 

40 after earlier storage, e.g. on a record/playback deck 
19 connected to a monitor 62. In such a conven- 
tional approach the transcript would also be viewed 
on the monitor 62 that is connected with the text 
store 24. In this conventional approach the tran- 

45 script is manually aligned with the video-audio data 
recording 12. 

In the present invention the audio data is pro- 
cessed via an automatic speech recognizer (ASR) 
34 that is connected with the record/playback deck 

so 19. The output of the ASR 34 is the decoded text 
38. This decoded text is time-aligned with the 
audio data that is stored in 40 (and that is the same 
as a corresponding portion of audio-data on record- 
ing medium 12. 

55 Notice that the audio data is used several 

times, first, the audio data is passed to the de- 
coder. Second, a part of the audio data is aligned 
with the decoded text. Figure 3 shows that the part 
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of the audio data is used for alignment in the 
decoded block. 

For this purpose the audio data from the main 
storage in 12 should be copied to temporary stor- 
age 40 for the purpose of aligning it with the text. 

This time-alignment is obtained by the follow- 
ing operations. 

First, the audio data is time stamped while it is 
being recorded on deck 19. This time stamping is 
done by timer 16, and represents the recording 
location of the audio data. For example, if the audio 
data is divided into frames of 10 milliseconds dura- 
tion, then the frames are sequentially time stamped 
with numbers 

N 
100 

seconds, where N is a positive integer. Alternative- 
ly, if the audio recording is made on magnetic 
recording tape, the time stamp for an audio seg- 
ment may represent the length of the recording 
tape from the start of the recording to the audio 
segment. 

Secondly, the time stamped audio data pro- 
vides a basis for time stamping of decoded words. 
For this purpose, the time stamped data is passed 
to automatic speech recognizer 34 and stored in 
40. The procedure of time stamping of decoded 
words in the text 38 is done by the time-alignment 
device 42. It words according to the following al- 
gorithms. 

(Algorithm A) The identification of the probable 
beginning and end of a speech segment F that 
corresponds to the i-th part of the whole text - Ti. 

This identification is performed in two steps. 

(1) Let T1, T2 Ti, ... partition the whole text. 

The input is an i-th speech segment F (that is 
stored in 40) and the decoded text Ti is output 
by automatic speech recognizer 34. Ti is pro- 
duced by the ASR 34 when it decodes the audio 
recording segment F. The decoded text Ti is the 
text that maximises the likelihood score Prob- 
(Ti[F), where Prob(T^F) is the conditional prob- 
ability of decoded text Ti given the occurrence 
of the recorded audio segment F. 

Let the speech segment F be given as a 
collection of frames F1, F2 

F={F1,F2,F3 FK}. Each frame may be. for 

example, of 10 milliseconds duration. An acous- 
tic information signal unit consists of one or 
more recorded frames. Therefore, each decoded 
word corresponds to one or more recorded 
frames. 

(2) A set of candidate frames F(k-1). F(k-2) 

F(k + 1). F(k + 2), ... near F(k) are considered to 
find the most like beginning of the first word W 
in the text. The most like candidate frames Fr 



can be chosen as those that give the largest 
value for the following expression P = Prob(Fr, 

F(r+1) F1|W)/N1 where N1 is a normalized 

factor (to ensure that we have functions with 
5 peaks), and frames Fr are chosen close to the 
frame Fk found in the previous step, and for 
each fixed frame Fr, the frame F1 is chosen as 
those for which the expression P has the pick as 
the function of 1 . 
10 This time alignment in both steps can be done 

efficiently using the Viterbi alignment. (See, for 
example, L.R. Bahl, F. Jelinek, R.L. Mercer "A 
Maximum Likelihood Approach to Continuous 
Speech Recognition", IEEE Transactions on Pat- 
is tern and Machine Intelligence . Vol. PAMI-5. March 
1983, pages 179-190.) 

This algorithm could also use some criteria for 
rejecting bad alignments, i.e. alignments that pro- 
duce low likelihood scores for all possible can- 
20 didate words for a considered list of speech seg- 
ments. In such a case, the alignment should be 
done only for those parts that give good likelihood 
scores. Parts of the audio that were rejected by 
this criteria could be time stamped from other 
25 considerations as taking into account the length of 
words, relative speed of speaking, etc. If 'rejected' 
intervals are relatively short, then these mechanical 
methods give good approximation for time stamp- 
ing of each word in the text. Also one can iterative- 
30 ly continue refining the segmentation of the frame 
string given a decoded text. That is, given a 
speech segment, decode the text. Then, given the 
decoded text, redefine the speech segment by 
identifying the most probable beginnings and en- 
35 dings of the decoded words. The process can then 
be repeated with the redefined speech segment. 

In the present invention, another variant of the 
mode of operation of the automatic speech recog- 
nizer 34 is possible. Namely, the automatic speech 
40 recognizer may receive information about the con- 
tent of the text store 24 from mapping block 44. 
This content defines the part and size of a tape to 
be played to the automatic speech recognizer 34 
and affects the decoding as described in Figure 4 
46 below. 

The reference transcript in text store 24 con- 
strains the work of automatic speech recognizer 34 
by determining the set of possible sentences, the 
size of the reference text also determines the maxi- 
50 mum size of a speech segment to be considered - 
not longer than the number of words in text Ti 
times the average number of frames in long words. 

In the present invention, the decoded text 38 
and the transcript in text store 24 are provided to a 
55 mapping module. In the mapping module the de- 
coded text and the transcript are matched accord- 
ing to the block scheme of Figure 3 (described 
below). 
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The aligned decoded text 38 and reference in 
text in store 24 from the mapping module 44 are 
passed to the block 48. In the block 48, the audio 
data is aligned against the reference transcript. 
This alignment is obtained by the alignment of 
audio data with the decoded text 38 (as described 
in more detail in Figure 4 below). 

In the next step, the aligned audio-transcript 
data passes to the block 54 where the aligned 
audio-transcript data is aligned against the video 
data 12. This video data 12 is received from the 
deck 19. The alignment of video data with audio- 
transcript data is done in accordance with timing 
information about audio and video data (which was 
produced by the timing block 16). As mentioned 
above, this information is stored on the tape 19. 

The whole aligned audio-video-transcript data 
from the block 54 goes to the indexing block 60 
where the indexing is done in accordance with the 
transcript in text store 24 by choosing key words or 
phrases. The indexing data in 60 can be monitored 
and retrieved from the monitor 62. Monitor 62 may 
be a computer terminal through which a user can 
read text, watch a video, hear audio, or manipulate 
the observed information. 

The work of the decoder 34 is controlled by 
the segmentation block 32. The block 32 receives 
control parameters from the mapping block 44. 
These parameters include the size of the text, 
grammatical structures for the text, etc. These pa- 
rameters are used to determine (1) the size of 
speech segment from 19 to be passed to the 
decoder 34. and (2) dictionary and grammar con- 
straints. 

This completes the description of the indexing 
of the audio-video data by the transcript data in 
accordance with Figure 3. 

Next comes the description of the mapping 
block 44 in Figure 4. The mapping block 44 re- 
ceives as input data a transcript 24 and a decoded 
text 38 (Figure 2). The transcript 24 goes to the 
block 201 where it is partitioned into texts of small- 
er sizes Ti (i = 1 . 2 k). Each partition may be. for 

example, a sequence of 10-15 words. If the de- 
coder 34 is designed to handle a large vocabulary, 
each partition may be 100-1000 words. Preferably, 
partitions end on a period or other indication of the 
end of a sentence, to avoid splitting a sentence 
among two partitions. After this operation, the work 
is performed as follows. The parameter i in the 
block 211 starts with the initial value i-1. The text 
T1 is copied to the block 202and processed as 

described below. The successive fragments T2 

Ti, ... are processed in the same manner. (After the 
alignment of Ti with the corresponding part of the 
tape is finished, the value of i increased by 1 and 
the procedure is performed for the T(i + 1).) 



After text Ti is copied to the block 202 it can 
be processed with one of the following options. 

1) If the text is handwritten, it is passing the 
automatic handwriting recognizer 213, where it 

s is decoded, and the decoded text is sent to the 
selection block 215 together with confidence 
scores. These scores reflect the likelihood of 
handwritten strokes being correctly decoded. 
There are many different methods for computing 

w confidence scores. Let for example, given hand- 
written strokes HW, there be a few candidates 
W1, W2. W3. which the search for the best 
match is conducted. Let L(W1.HW). L- 
(W2.HW,. ..),... be likelihood scores that measure 

is a degree of match of the strokes WH to words 

W1, W2 respectively. Then the sharpness of 

the normalized peak at the maximum likelihood 
score could represent the level of confidence 
that the handwritten strokes are decoded cor- 

20 rectly. 

In the selector block 215 words with con- 
fidence scores higher than a predetermined 
threshold are selected, and numbered in accor- 
dance with their place in the handwritten tran- 

25 script. All this information is sent to the formal 
block 216 where the following operations are 
performed. 

a) The words are marked with labels repre- 
senting information about their places in the 

30 transcript. The set of (word, label) pairs are 

formatted in block 216 (for example, in AS- 
CII), and the formatted information is sent to 
the comparison block 209. In this comparison 
block 209, the transcript words (index words) 

35 will be compared with the formatted decoded 

text 38 received from automatic speech re- 
cognizer 34. 

b) The information about the size of the list Ti 
(either approximate or exact number of words 

40 or lines in the file containing Ti) is sent to the 

segmenter block 32 (Figure 2). The transcript 
Ti is sent to the automatic speech recognizer 
34. 

2) If the transcript in text store 24 was produced 
45 by scanning typed materials (e.g. books, faxes 

etc.), the transcript file is sent to an automatic 
character recognizer (OCR) in block 214. De- 
coded output from 214 block (decoded words, 
confidence scores) is sent to the selection block 
so 21 5A with a similar procedure as described in 1) 
above. 

3) The transcript is represented in a format such 
as ASCII or BAUDOT characters. In this case, 
the transcript is sent to the formal block 216 and 

55 then is processed as in case 1 ). 

The following is the description of the proce- 
dure in the block 209. This block receives recur- 
sively the decoded text DTi that corresponds to the 
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text Ti. The decoded text DTi aligned against its 
reference text Ti (starting recursively from i-1). 

There are three main cases as this alignment is 
done. 

Case 1: The transcript Ti is an exact repre- 
sentation of the recorded speech. 

In this case, words in DTi are considered 'cor- 
rect' if they coincide with words in Ti that have the 
same context (occur approximately in the same 
'places') as corresponding words in DTi. In order to 
define whether two equal words have the same 
context (occur in the same "place"), one can pro- 
ceed as follows. 

1) Words in Ti and DTi can be enumerated. 
Say, words in DTi are numbered as 
DW1.DW2.... as they appear in the text. Simi- 
larly, words in the. text Ti can be numbered as 
W1, W2, W3 .... words DWi and Wj could be 
considered as approximately at the same place 
if [/ - /j < d where the threshold value d is a 
small number (say. 3 or 4). If the words DWi 
and Wj are approximately at the same place and 
are equal (i.e. are the same character strings) 
then they could be considered as matched. 

2) Use the methods that are described in the 
paper by P.F. Brown, et al entitled "Aligning 
Sentences in Parallel Corpora", ( Proceedings 
29th Annual Meeting of the Association for Com- 
putational Linguistics , Berkeley, California, 1991. 
pages 169-176.) 

The first algorithm can be improved as fol- 
lows. 

a) If words DWi and Wj that are equal (as 
strings of characters) have large length (de- 
fined as number of characters), then the dis- 
tance d = \i - j\ can be comparatively large 
(say, d = 5 or 6). At the same time, if these 
words are short (for example DWi is 'a' or 
'the') then d should be small (1 or 2). 

b) If equal words DWi and Wj have low 
frequency of occurrence (that is measured as 
the frequency with which these words occur 
in large corpora of texts) then d = |/ - J| can 
be chosen larger. On the other hand, if DWi- 
Wi are very frequent words (like 'a', 'if. 'no') 
then d = \l - J\ should be chosen smaller. 

c) If pairs of words DW(M), DWi and W(j-1). 
Wj are equal then the difference d = \i - J\ 
cna be chosen larger. Similarly, if trigrams 
DW(i-2), DW(i-1). DWi and W(j-2)W0-1 )Wj are 
equal then d = \l - j[ can be increased 
further. Similar increases in admissible values 
of d = \i - j[ can be considered for other n- 
grams with increasing n. 

d) Comparison of words in DTi with words in 
Ti is done only for those words in DTi that 
have confidence scores higher than some 
threshold. (Confidence scores of decoded 



words were discussed in 1 ) above. 

3) Words in DTi can be indexed by acoustic 
frames during decoding (i.e. the beginning and 
end of each word in DTi correspond to some set 

5 of acoustic frames (see Algorithm A above). One 

can approximately align the text Ti against the 
string of frames by defining the average speed 
as the ratio of the number of words in Ti to 
number of frames on the tape used in dictating 

io this text. Then the produce of this speed by the 
number of words preceding a given word de- 
fines its approximate relative place in the text. 

4) Align the string of frames with a string of 
phonemes using known continuous speech rec- 

T5 ognition algorithms (e.g. H.C. Leung, V.W. Zue, 
"A Procedure For Automatic Alignment of Pho- 
netic Transcriptions With Continuous Speech", 
Proceedings of ICASSP 84 , pages 2.7.1 - 2.7.3, 
1984). Match this phonetic string with the de- 

20 coded text DTi via the correspondence with the 
acoustic frame string. Use rules (or a table) to 
produce a phonetic string for the reference text 
Ti. Then consider words in texts DTi and Ti as 
being in the same place if they are surrounded 

26 by similar phonetic substrings. 

Case 2: 

1. In this case the algorithm is similar to the 
30 algorithm in the case 1 , except that some rules 
above are loosened. The following are examples 
of modifications of the above algorithms. 

a) Words from DTi and Ti are considered to 
be matched if they are approximately at the 
35 same place (in the sense of the definition in 

the sense of 2. above) and their length (i.e. 
number of characters from which they are 
composed) is large (e.g. 5 or 7). The exact 
length of words that are allowed to be com- 
40 pared depends on the level of approximation 

of the text Ti. The more Ti is an approxima- 
tion of the recorded speech, the larger should 
be lengths of words required to obtain a 
match. 

45 b) Insertions, omissions and changes in or- 

ders are allowed in comparison of n-grams of 
words like in c). For example, the trigram W- 
(j-2)W(j-1)Wj in DTi can be matched against 
the five-gram V(j-3)V(j-2)V(j-l_VjV(j + 1 ) from 

so Ti if W(j-2) = V(j-3), W(M ) = V(i-1 ) and Wi=Vi. 

and if matched words have sufficient lengths. 
In this example, the other words V(j-2) and V- 
(j + 1) from Ti could be considered as inser- 
tions. 

55 Similarly, if n-grams in DTi and Ti are equal 

after interchanging the order of the words, then 
corresponding words could be considered as 
matched. 
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2. For each word segment in the speech data, 
compare the score of the aligned word from the 
provided script Ti with the score of a decoded 
word in DTi for that speech segment. Insert or 
replace the script word with the decoded word if 
the difference satisfies a specified threshold. 

Case 3. Speech data and a summary script is 
available 

a) Identify words in DTi with high confidence 
scores {anchor points). 

b) Align the anchor point sequence to the avail- 
able summary script. In this case, the summary 
script and the speech data most often do not 
have time correlation. This is so, because, in 
preparing a summary, the author could rear- 
range the topics in the speech data at this 
discretion. To this effect, the summary will be 
broken into sentences and then all anchor points 
will be matched against all summary sentences. 
According to a threshold mechanism, an anchor 
point will be mapped to none, one. or many 
sentences. A sentence would also be mapped to 
none, one, or many anchor points. (A hidden 
Markov model for the production of summary 
sentences from anchor points using semantic 
similarity is trained and used for Viterbi align- 
ment.) 

c) Use the alignment result to break the sum- 
mary into segments each associated with an 
anchor point. Since the anchor points carry time 
stamps, we achieve a time alignment between 
the summary script and the speech data. 

d) Repeat this process on the subsegments that 
can still be broken into smaller parts. 

The following is an explanation of Figure 5. The 
block 401 contains a decoded text (ordered series 
of recognized words) DT that is schematically re- 
presented by a vertical left series of words 1.2.3, 
...8 and a transcript T that is schematically repre- 
sented by a vertical right series of words 1'.2',3' ... 
T. The pairs of words (1,1'), (4.5 1 ). (8.7*) were 
matched as described in Figure 4. The series of 

words 1,2 8 is aligned against audio data (block 

402) in the course of decoding (Figure 3 block 42), 
as schematically shown inside block 402. Let (TO, 
T1). (TI. T2)....(T7. T8) correspond to the begin- 
nings and ends of words 1.2.3...8, respectively. 
Then the matched transcript words V, 5', T will 
correspond to time data (T0.T1), (T3.T4), (T7.T8). 
respectively (via the matched decoded words). 

Remaining decoded words can be aligned with 
the time data by linear interpolation. For example, 
time segment (TI, T3) corresponds to the word 
segment W2. W3. and can be aligned in accor- 
dance with the length of words. For example, if W2 
consists of N phonemes and W3 of M phonemes. 



and t-T3-T1 then the segment S^TI, T1 +t*N/- 
(N + M)" corresponds to W2, and the segment 
*T1 + *N/(N + M). T3" corresponds to W3. 

The aligned transcript-audio data is transferred 

5 to the block 403 where is it aligned with video data 
from the record/playback deck 19 of Figure 3. This 
alignment is obtained by time stamping that was 
done for audio-video data. 

The following is an explanation of Figure 6 in 

10 which the speech recognizer vocabulary is ob- 
tained from segments of the text transcript. The 
block 501 contains the current part of a transcript 
Ti that is processed. This part of the transcript Ti is 
used to derive the vocabulary V 504 from which 

is the text in Ti was formed, and the approximate size 
503 of the tape section 505 that contains the 
speech that corresponds to Ti. The size can be 
obtained estimating for each word Wr in Ti the 
maximum possible size Dr of its corresponding 

20 audio data on the tape, and taking the sum of all Dr 
(r = 1 ,2...) in a segment as the length of a segment 
in the tape. 

This information is transferred to the block 502 
where the following tasks are performed. The end 
25 of the audio segment on the tape that corresponds 
to the previous T(i-1) text (or the beginning of the 
tape for the first T1 segment) is identified, the next 
segment of the tape with length that is provided 
from the block 501 is played automatic speech 
30 recognizer 506. The automatic speech recognizer 
decodes this audio data using the vocabulary that 
was provided from the block 501. The automatic 
speech recognizer sends each decoded series of 
words W1.W2, ... Wk to the block 501 and com- 
as pares it with the text Ti. If the decoded series of 
words matches well with tine corresponding part 
VI, V2....V1 in Ti. then the next word V(1+1) is 
added to the list of alternative words the automatic 
speech recognizer is processing in decoding the 
40 corresponding segment of audio data. (This can- 
didate word V(1 +1) could be given with an addi- 
tional score that represents the likelihood of being 
the next word in the considered path). After the 
whole text Ti is decoded, the end of the tape audio 
45 data that corresponds to the end of the text is 
defined. This end of the audio segment is trans- 
ferred to the next step (decoding of T(i + 1)) part of 
the text if Ti is not the last segment in T. 

50 Claims 

1. An apparatus for indexing an audio recording 
comprising: 

an acoustic recorder for storing an ordered 
55 series of acoustic information signal units re- 

presenting sounds generated from spoken 
words, said acoustic recorder having a plurality 
of recording locations, each recording location 
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storing at least one acoustic information signal 
unit; 

a speech recognizer for generating an ordered 
series of recognized words having a high con- 
ditional probability of occurrence given the oc- 5 
currence of the sounds represented by the 
acoustic information signals, each recognized 
word corresponding to at least one acoustic 
information signal unit, each recognized word 
having a context of at least one preceding or io 
following recognized word; 
a text storage device for storing an ordered 
series of index words, said ordered series of 
index words comprising a visual representation 
of at least some of the spoken words repre- is 
sented by the acoustic information signal units, 
each index word having a context of at least 
one preceding or following index word; and 
means for comparing the ordered series of 
recognized words with the ordered series of 20 
index words to pair recognized words and in- 
dex words which are the same word and which 
have matching contexts, and for tagging each 
paired index word with the recording location 
of the acoustic information signal unit corre- 25 
sponding to the recognized word paired with 
the index word. 

An apparatus as claimed in Claim 1 , 
characterized in that the speech recognizer 30 
aligns each recognized word with at least one 
acoustic information signal unit. 

An apparatus as claimed in Claim 1 or 2, 
characterized in that: 35 
each recognized word which is not paired with 
an index word has a nearest preceding paired 
recognized word in the ordered series of rec- 
ognized words, and has a nearest following 
paired recognized word in the ordered series 40 
of recognized words; 

the context of a target recognized word com- 
prises the number of other recognized words 
preceding the target recognized word and fol- 
lowing the nearest preceding paired recog- 45 
nized word in the ordered series of recognized 
words: 

the context of a target index word comprises 
the number of other index words preceding the 
target index word and following the nearest so 
preceding paired index word in in the ordered 
series of index words; and 
the context of a recognized word matches the 
context of an index word if the context of the 
recognized word is within a selected threshold 55 
value of the context of the index word. 



4. A method of indexing an audio recording com- 
prising: 

storing an ordered series of acoustic informa- 
tion signal units representing sounds gener- 
ated from spoken words, said acoustic record- 
er having a plurality of recording locations, 
each recording location storing at least one 
acoustic information signal unit; 
generating an ordered series of recognized 
words having a high conditional probability of 
occurrence given the occurrence of the sounds 
represented by the acoustic information sig- 
nals, each recognized word corresponding to 
at least one acoustic information signal unit, 
each recognized word having a context of at 
least one preceding or following recognized 
word; 

storing an ordered series of index words, said 
ordered series of index words comprising a 
visual representation of at least some of the 
spoken words represented by the acoustic in- 
formation signal units, each index word having 
a context of at least one preceding or following 
index word; 

comparing the ordered series of recognized 
words with the ordered series of index words 
to pair recognized words and index words 
which are the same word and which have 
matching contexts; and 

tagging each paired index word with the re- 
cording location of the acoustic information 
signal unit corresponding to the recognized 
word paired with the index word. 

5. A method as claimed in any one of the preced- 
ing claims, characterized in that: 

each recognized word comprises a series of 
one or more characters; 

each index word comprises a series of one or 
more characters; and 

a recognized word is the same as an index 
word when both words comprise the same 
series of characters. 

6. A method as claimed in any one of the preced- 
ing claims, characterized in that: 

the context of a target recognized word com- 
prises the number of other recognized words 
preceding the target recognized word in the 
ordered series of recognized words; 
the context of a target index word comprises 
the number of other index words preceding the 
target index word in the ordered series of 
index words; and 

the context of a recognized word matches the 
context of an index word if the context of the 
recognized word is within a selected threshold 
value of the context of the index word. 
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7. A method as claimed in any one of the preced- 
ing claims, characterized in that: 

each index word which is not paired with a 
recognized word has a nearest preceding 
paired index word in the ordered series of s 
index words, and has a nearest following 
paired index word in the ordered series of 
index words; and 

the step of tagging comprises tagging a non- 
paired index word with a recording location 10 
between the recording location of the nearest 
preceding paired index word and the recording 
location of the nearest following paired index 
word. 

75 

8. A method as claimed in any one of the preced- 
ing claims, further comprising the step of align- 
ing each recognized word with at least one 
acoustic information signal unit. 

20 

9. A method as claimed in any of the preceding 
claims, characterized in that: 

each recognized word which is not paired with 
an index word has a nearest preceding paired 
recognized word in the ordered series of rec- 25 
ognized words, and has a nearest following 
paired recognized word in the ordered series 
of recognized words; 

the context of a target recognized word com- 
prises the number of other recognized words 30 
preceding the target recognized word and fol- 
lowing the nearest preceding paired recog- 
nized word in the ordered series of recognized 
words; 

the context of a target index word comprises 35 
the number of other index words preceding the 
target index word and following the nearest 
preceding paired index word in in the ordered 
series of index words; and 

the context of a recognized word matches the 40 
context of an index word if the context of the 
recognized word is within a selected threshold 
value of the context of the index word. 
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